Import Necessary Libraries:
Load and Explore the Data:
Data Preprocessing:
Exploratory Data Analysis (EDA):
Train-Test Split:
Build and Train the Logistic Regression Model:
Evaluate the Model:
Interpret the Results:
The HBFC Loan Customer Analysis Project involved developed logistic regression model to identify potential loan customers, providing valuable insight for strategic decision-making and optimizing loan offerings.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix , classification_report ,accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay
Now, let's provide a brief explanation for each import statement:
pandas (pd):
numpy (np):
seaborn (sns):
matplotlib.pyplot (plt):
Scikit-learn:
LabelEncoder:
train_test_split:
confusion_matrix, classification_report, accuracy_score:
LogisticRegression:
Reading data from the Excel file into a Pandas DataFrame by creating a new variable df(Data Frame). The 'sheet_name' parameter specifies the sheet to read (if there are multiple sheets)
df=pd.read_excel(r'C:\Users\ASUS\Downloads\HBFC Bank.xlsx',sheet_name='Bank_Personal_Loan_Modelling')
Displaying the first 5 rows of the DataFrame to inspect the data
df.head(5)
| ID | Age (in years) | Experience (in years) | Income (in K/year) | Income Categorical | ZIP Code | Family members | CCAvg | Education | Mortgage | Personal Loan | Securities Account | TD Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15 | 67 | 41 | 112 | 100+ | 91741 | 1.0 | 2.0 | Undergraduate | 0 | No | Yes | No | No | No |
| 1 | 1481 | 67 | 42 | 32 | 0-50 | 93943 | 1.0 | 1.1 | Professional | 0 | No | No | No | No | Yes |
| 2 | 1860 | 67 | 41 | 20 | 0-50 | 91741 | 2.0 | 0.4 | Undergraduate | 80 | No | No | No | No | No |
| 3 | 2847 | 67 | 43 | 105 | 100+ | 93711 | 4.0 | 1.7 | Graduate | 0 | No | No | No | Yes | No |
| 4 | 3265 | 67 | 41 | 114 | 100+ | 95616 | 4.0 | 2.4 | Professional | 0 | No | No | No | Yes | No |
Displaying the last 5 rows of the DataFrame to inspect the data
df.tail(5)
| ID | Age (in years) | Experience (in years) | Income (in K/year) | Income Categorical | ZIP Code | Family members | CCAvg | Education | Mortgage | Personal Loan | Securities Account | TD Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 3158 | 23 | 1 | 13 | 0-50 | 94720 | 4.0 | 1.0 | Undergraduate | 84 | No | No | No | Yes | No |
| 4996 | 3426 | 23 | 1 | 12 | 0-50 | 91605 | 4.0 | 1.0 | Undergraduate | 90 | No | No | No | Yes | No |
| 4997 | 3825 | 23 | 1 | 12 | 0-50 | 95064 | 4.0 | 1.0 | Undergraduate | 0 | No | Yes | No | No | Yes |
| 4998 | 4286 | 23 | 3 | 149 | 100+ | 93555 | 2.0 | 7.2 | Undergraduate | 0 | No | No | No | Yes | No |
| 4999 | 4412 | 23 | 2 | 75 | 51-100 | 90291 | 2.0 | 1.8 | Graduate | 0 | No | No | No | Yes | Yes |
Important Columns in the DataFrame to consider or Variables in data frame
Data preprocessing is a crucial step in the data analysis and machine learning pipeline. It involves cleaning and transforming raw data into a format that is suitable for analysis or model training. The goal of preprocessing is to enhance the quality of the data
Data Cleaning is one of the steps involved in Data Preprocessing. Following things we perform in Data Cleaning.
Handle missing values: Impute or remove missing data. If there is no missing values we don't perform these task.
Correct inconsistencies: Resolve errors and inconsistencies in the data if presnt.
Handle outliers: Identify and address outliers. Below we check for Outliers in our Data Frame.
# Check for missing values
df.isnull().sum()
ID 0 Age (in years) 0 Experience (in years) 0 Income (in K/year) 0 Income Categorical 0 ZIP Code 0 Family members 18 CCAvg 0 Education 0 Mortgage 0 Personal Loan 0 Securities Account 0 TD Account 0 Online 0 CreditCard 0 dtype: int64
As we can see there is no missing values for the columns we consider as important earlier but there are missing values for the other columns
df.columns
Index(['ID', 'Age (in years)', 'Experience (in years)', 'Income (in K/year)',
'Income Categorical', 'ZIP Code', 'Family members', 'CCAvg',
'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
'TD Account', 'Online', 'CreditCard'],
dtype='object')
example of if we have Removing the unwanted variables or columns
## df.drop(columns=['Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'COMBINATION OF TD & CC','COMBINATION OF TD & CC, Person who don’t have personal loan','Unnamed: 23'],inplace=True)
Checking whether the columns are removed or not.
df.head(5)
| ID | Age (in years) | Experience (in years) | Income (in K/year) | Income Categorical | ZIP Code | Family members | CCAvg | Education | Mortgage | Personal Loan | Securities Account | TD Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15 | 67 | 41 | 112 | 100+ | 91741 | 1.0 | 2.0 | Undergraduate | 0 | No | Yes | No | No | No |
| 1 | 1481 | 67 | 42 | 32 | 0-50 | 93943 | 1.0 | 1.1 | Professional | 0 | No | No | No | No | Yes |
| 2 | 1860 | 67 | 41 | 20 | 0-50 | 91741 | 2.0 | 0.4 | Undergraduate | 80 | No | No | No | No | No |
| 3 | 2847 | 67 | 43 | 105 | 100+ | 93711 | 4.0 | 1.7 | Graduate | 0 | No | No | No | Yes | No |
| 4 | 3265 | 67 | 41 | 114 | 100+ | 95616 | 4.0 | 2.4 | Professional | 0 | No | No | No | Yes | No |
df.shape
(5000, 15)
df.describe()
| ID | Age (in years) | Experience (in years) | Income (in K/year) | ZIP Code | Family members | CCAvg | Mortgage | |
|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 | 4982.00000 | 5000.000000 | 5000.000000 |
| mean | 2500.500000 | 45.338400 | 20.13480 | 73.774200 | 93152.503000 | 2.39723 | 1.937938 | 56.498800 |
| std | 1443.520003 | 11.463166 | 11.41488 | 46.033729 | 2121.852197 | 1.14716 | 1.747659 | 101.713802 |
| min | 1.000000 | 23.000000 | 0.00000 | 8.000000 | 9307.000000 | 1.00000 | 0.000000 | 0.000000 |
| 25% | 1250.750000 | 35.000000 | 10.00000 | 39.000000 | 91911.000000 | 1.00000 | 0.700000 | 0.000000 |
| 50% | 2500.500000 | 45.000000 | 20.00000 | 64.000000 | 93437.000000 | 2.00000 | 1.500000 | 0.000000 |
| 75% | 3750.250000 | 55.000000 | 30.00000 | 98.000000 | 94608.000000 | 3.00000 | 2.500000 | 101.000000 |
| max | 5000.000000 | 67.000000 | 43.00000 | 224.000000 | 96651.000000 | 4.00000 | 10.000000 | 635.000000 |
df[df.duplicated()]
| ID | Age (in years) | Experience (in years) | Income (in K/year) | Income Categorical | ZIP Code | Family members | CCAvg | Education | Mortgage | Personal Loan | Securities Account | TD Account | Online | CreditCard |
|---|
Above command is used to identify and display duplicated rows in a DataFrame. It checks for duplicated rows based on all columns and returns a DataFrame containing the duplicated rows. If there are no duplicates, it will return an empty DataFrame as you can see above.
df1=df.copy() # copying the datsets as a backup(not necessary).
Exploratory Data Analysis (EDA) is a crucial step in understanding and analyzing a dataset. It involves summarizing the main characteristics of a dataset, often with the help of statistical graphics and other data visualization methods. Here's a basic outline for performing EDA in a Jupyter Notebook using Python and common libraries like Pandas, Matplotlib, and Seaborn.
The Exploratory Data Analysis (EDA) process and visualization steps involved in analyzing the dataset can be summarized as follows:
Load and Explore Data:
head(), info(), and describe().Handle Missing Values and Initial Data Cleaning:
Univariate Analysis:
Bivariate Analysis:
Multivariate Analysis:
Statistical Summary:
Feature Engineering and Transformation:
Identify Outliers:
Correlation Analysis:
Final Feature Selection:
Concluding Insights:
Iterative Process:
This process involves a combination of statistical measures and visualizations to gain a comprehensive understanding of the dataset, identify patterns, and inform decisions about feature selection and potential transformations for subsequent modeling steps.
# Display basic information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age (in years) 5000 non-null int64 2 Experience (in years) 5000 non-null int64 3 Income (in K/year) 5000 non-null int64 4 Income Categorical 5000 non-null object 5 ZIP Code 5000 non-null int64 6 Family members 4982 non-null float64 7 CCAvg 5000 non-null float64 8 Education 5000 non-null object 9 Mortgage 5000 non-null int64 10 Personal Loan 5000 non-null object 11 Securities Account 5000 non-null object 12 TD Account 5000 non-null object 13 Online 5000 non-null object 14 CreditCard 5000 non-null object dtypes: float64(2), int64(6), object(7) memory usage: 586.1+ KB
The Above output has the following information for each column: Column Name: The name of the column. Non-Null Count: The number of non-null (non-missing) values in the column. Data Type: The data type of the values in the column. This includes types like int, float, object.
# Display summary statistics for numerical columns
df.describe()
| ID | Age (in years) | Experience (in years) | Income (in K/year) | ZIP Code | Family members | CCAvg | Mortgage | |
|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 | 4982.00000 | 5000.000000 | 5000.000000 |
| mean | 2500.500000 | 45.338400 | 20.13480 | 73.774200 | 93152.503000 | 2.39723 | 1.937938 | 56.498800 |
| std | 1443.520003 | 11.463166 | 11.41488 | 46.033729 | 2121.852197 | 1.14716 | 1.747659 | 101.713802 |
| min | 1.000000 | 23.000000 | 0.00000 | 8.000000 | 9307.000000 | 1.00000 | 0.000000 | 0.000000 |
| 25% | 1250.750000 | 35.000000 | 10.00000 | 39.000000 | 91911.000000 | 1.00000 | 0.700000 | 0.000000 |
| 50% | 2500.500000 | 45.000000 | 20.00000 | 64.000000 | 93437.000000 | 2.00000 | 1.500000 | 0.000000 |
| 75% | 3750.250000 | 55.000000 | 30.00000 | 98.000000 | 94608.000000 | 3.00000 | 2.500000 | 101.000000 |
| max | 5000.000000 | 67.000000 | 43.00000 | 224.000000 | 96651.000000 | 4.00000 | 10.000000 | 635.000000 |
The df.describe() method is used to generate descriptive statistics of the central tendency, dispersion, and shape of the distribution of each numerical column in your DataFrame. Here's what each of the statistics means:
Count: Number of non-null values. This tells you how many data points you have in each column.
Mean: Average value. It provides an indication of the central tendency of the data.
Std (Standard Deviation): A measure of the amount of variation or dispersion of a set of values.It quantifies the amount of variation or dispersion of each data point from the mean.
Min: The minimum value in the column.
25% (Q1): The first quartile or the 25th percentile. It is the value below which 25% of the data falls.
50% (Q2 or Median): The median or the 50th percentile. It is the middle value of the dataset.
75% (Q3): The third quartile or the 75th percentile. It is the value below which 75% of the data falls.
Max: The maximum value in the column.
Data visualization is a powerful tool in exploratory data analysis (EDA) to understand patterns, relationships, and trends within a dataset. We use Python libraries like Matplotlib and Seaborn.
Univariate analysis involves the exploration and analysis of a single variable at a time. The goal is to understand the distribution, central tendency, and variability of that variable. One of the example is below.
sns.countplot(data=df,x='Education')
<Axes: xlabel='Education', ylabel='count'>
sns.countplot(data=df, x='Education') uses Seaborn to create a countplot for the 'Education' column in our DataFrame. A countplot is a type of bar plot that shows the counts of observations in each category of a categorical variable.
sns.histplot(data=df,x='Age (in years)',hue='Education')#histplot is also Univariate analysis here we had plot the graph against age distribution for eduction.
plt.title("distribution of age")
Text(0.5, 1.0, 'distribution of age')
numerical_columns = df.select_dtypes(include=[np.number]).columns # Selecting only numerical coloumns
df_log = np.log1p(df[numerical_columns]) # Log-transform the data for clear visualization
# Creating box plots for log-transformed data to findings outlier
plt.figure(figsize=(15, 10))
sns.boxplot(data=df_log, orient='h', palette='Set2') # 'orient' set to 'h' for horizontal box plots
plt.title('Box Plots of Numerical Columns (Log-Transformed)')
plt.show()
The Label Encoder is imported from scikit-learn to encode categorical variables in our DataFrame(df). This is a common preprocessing step when working with machine learning algorithms that require numerical input. The LabelEncoder transforms categorical labels into numerical labels.
le=LabelEncoder()
a=df.select_dtypes('object') #Select columns with the 'object' data type. As they are categorical columns.
for x in a.columns:
df[x]=le.fit_transform(df[x]) # Apply the fit_transform method of the LabelEncoder to transform the values in the selected column from categorical to numerical.
df.head()
| ID | Age (in years) | Experience (in years) | Income (in K/year) | Income Categorical | ZIP Code | Family members | CCAvg | Education | Mortgage | Personal Loan | Securities Account | TD Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15 | 67 | 41 | 112 | 1 | 91741 | 1.0 | 2.0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1481 | 67 | 42 | 32 | 0 | 93943 | 1.0 | 1.1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 1860 | 67 | 41 | 20 | 0 | 91741 | 2.0 | 0.4 | 2 | 80 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2847 | 67 | 43 | 105 | 1 | 93711 | 4.0 | 1.7 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 3265 | 67 | 41 | 114 | 1 | 95616 | 4.0 | 2.4 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
plt.figure(figsize=(15,10)) # Sets the size of the figure to 15 inches in width and 10 inches in height
sns.heatmap(df.corr(),annot=True,cmap='winter')#Creates a heatmap using seaborn. The df.corr() computes the correlation matrix for the DataFrame df. The annot=True argument adds the correlation values to the heatmap. The cmap='winter' argument sets the color map for the heatmap to a winter color scheme.
<Axes: >
df.head(20)
| ID | Age (in years) | Experience (in years) | Income (in K/year) | Income Categorical | ZIP Code | Family members | CCAvg | Education | Mortgage | Personal Loan | Securities Account | TD Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15 | 67 | 41 | 112 | 1 | 91741 | 1.0 | 2.00 | 2 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1481 | 67 | 42 | 32 | 0 | 93943 | 1.0 | 1.10 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 1860 | 67 | 41 | 20 | 0 | 91741 | 2.0 | 0.40 | 2 | 80 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2847 | 67 | 43 | 105 | 1 | 93711 | 4.0 | 1.70 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 3265 | 67 | 41 | 114 | 1 | 95616 | 4.0 | 2.40 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 5 | 3332 | 67 | 42 | 21 | 0 | 94607 | 3.0 | 0.10 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 6 | 3704 | 67 | 41 | 78 | 2 | 94301 | 4.0 | 2.40 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 7 | 3887 | 67 | 43 | 79 | 2 | 95616 | 4.0 | 1.70 | 0 | 215 | 0 | 0 | 1 | 1 | 1 |
| 8 | 4173 | 67 | 42 | 75 | 2 | 90041 | 4.0 | 0.10 | 0 | 182 | 0 | 0 | 0 | 1 | 0 |
| 9 | 4361 | 67 | 43 | 41 | 0 | 90024 | 2.0 | 1.10 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10 | 4452 | 67 | 41 | 18 | 0 | 92130 | 2.0 | 0.40 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 11 | 4469 | 67 | 42 | 51 | 2 | 94117 | 3.0 | 2.20 | 2 | 0 | 0 | 0 | 0 | 1 | 1 |
| 12 | 100 | 66 | 41 | 15 | 0 | 91711 | 3.0 | 0.10 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 13 | 256 | 66 | 40 | 42 | 0 | 92103 | 2.0 | 0.70 | 1 | 138 | 0 | 0 | 0 | 0 | 1 |
| 14 | 258 | 66 | 41 | 18 | 0 | 92691 | 3.0 | 0.50 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| 15 | 466 | 66 | 42 | 35 | 0 | 94305 | 1.0 | 1.90 | 0 | 172 | 0 | 0 | 0 | 1 | 0 |
| 16 | 669 | 66 | 41 | 18 | 0 | 94010 | 3.0 | 0.50 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 735 | 66 | 42 | 53 | 2 | 92182 | 2.0 | 1.10 | 2 | 0 | 0 | 0 | 0 | 1 | 1 |
| 18 | 909 | 66 | 36 | 55 | 2 | 93023 | 4.0 | 1.67 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 19 | 1232 | 66 | 41 | 144 | 1 | 94306 | 1.0 | 2.50 | 2 | 0 | 0 | 1 | 1 | 1 | 1 |
The goal of EDA is to gain insights into the underlying patterns, distributions, and relationships within the data, helping to inform subsequent steps in the data analysis or modeling process.
The below we use some commands that generate a variety of visualizations, including histograms, boxplots, and scatter plots, to explore the relationships between different features and the likelihood of taking a personal loan ('Personal Loan' category). These visualizations can provide insights into the distribution and patterns within the data.
sns.boxplot(data=df,x='Age (in years)')
plt.show()
sns.scatterplot(data=df,x='Age (in years)',y='Experience (in years)')
plt.show()
sns.histplot(data=df,x='Age (in years)')
plt.show()
Online=sns.FacetGrid(df,col='Personal Loan')
Online.map(plt.hist,'Online',bins=20)
plt.show()
income=sns.FacetGrid(df,col='Personal Loan')
income.map(plt.hist,'Income (in K/year)',bins=20)
plt.show()
From the above graph, it is evident that customers with lower incomes have not taken out loans.
cc=sns.FacetGrid(df,col='Personal Loan')
cc.map(plt.hist,'CCAvg',bins=20)
plt.show()
Customers with lower average credit scores have not taken out loans
edu=sns.FacetGrid(df,col='Personal Loan')
edu.map(plt.hist,'Education',bins=20)
plt.show()
Here, '2' represents 'Undergraduate.' We can observe that individuals categorized as undergraduates have not taken out loans
td=sns.FacetGrid(df,col='Personal Loan')
td.map(plt.hist,'TD Account',bins=20)
plt.show()
Customers without a TD account have not taken a loan.
mortagage=sns.FacetGrid(df,col='Personal Loan')
mortagage.map(plt.hist,'Mortgage',bins=5)
plt.show()
the groupby function to group the DataFrame df by the 'Personal Loan' column and then counting the number of occurrences in each group. Here 0 means not taken the personal loan and 1 means taken the personal loan.
df.groupby(['Personal Loan']).count()
| ID | Age (in years) | Experience (in years) | Income (in K/year) | Income Categorical | ZIP Code | Family members | CCAvg | Education | Mortgage | Securities Account | TD Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Personal Loan | ||||||||||||||
| 0 | 4520 | 4520 | 4520 | 4520 | 4520 | 4520 | 4504 | 4520 | 4520 | 4520 | 4520 | 4520 | 4520 | 4520 |
| 1 | 480 | 480 | 480 | 480 | 480 | 480 | 478 | 480 | 480 | 480 | 480 | 480 | 480 | 480 |
sns.pairplot(df,hue='Personal Loan',diag_kind='kde')
plt.show()
Here sns.pairplot(df, hue='Personal Loan', diag_kind='kde') is using the seaborn library to create a pair plot for the DataFrame df. A pair plot is a grid of scatterplots showing relationships between pairs of variables, and when the data is colored by a categorical variable (in this case, 'Personal Loan'), it can help visualize how different features vary together based on the target variable. Here's an explanation of the parameters used:
Here's what the pair plot does:
The scatterplots in the lower triangle show the relationship between pairs of numerical features. Each point represents a data point, and the color of the points is based on the 'Personal Loan' category.
The kernel density plots (diagonal subplots) show the distribution of each numerical variable, separated by 'Personal Loan' category.
This type of visualization can be useful for quickly identifying patterns or trends in the data, especially when exploring the relationships between multiple variables in the context of the target variable ('Personal Loan').
After thorough exploratory data analysis (EDA) and visualization, the decision is to exclude the following variables from further analysis and modeling:
The decision to exclude these variables is driven by the goal of refining the feature set for more focused and effective predictive modeling. This process ensures that the features selected for the model are more likely to contribute meaningful information and enhance model performance.
The train-test split is a crucial step in the machine learning workflow that involves dividing our dataset into two subsets: one for training the model and another for evaluating its performance. This ensures that this model is tested on data it has never seen before, providing a more accurate assessment of its generalization capabilities.
y=df[['Personal Loan']]
x=df.drop(columns=['ID', 'Age (in years)','Experience (in years)','Income Categorical', 'ZIP Code', 'Family members','Personal Loan', 'Securities Account','CreditCard'])
y: This is our target variable, the variable we want to predict (in this case, 'Personal Loan'). It is kept in a DataFrame format using double brackets.
x: These are our features, the variables that will be used to make predictions. we have excluded certain columns (ID, Age, Experience, etc.) from the original DataFrame using the drop method.
Now, x contains the features you'll use for training your machine learning model, and y contains the corresponding target variable. now we can proceed to split the data into training and testing sets and then train our model on the training set.
x.head(10)
| Income (in K/year) | CCAvg | Education | Mortgage | TD Account | Online | |
|---|---|---|---|---|---|---|
| 0 | 112 | 2.0 | 2 | 0 | 0 | 0 |
| 1 | 32 | 1.1 | 1 | 0 | 0 | 0 |
| 2 | 20 | 0.4 | 2 | 80 | 0 | 0 |
| 3 | 105 | 1.7 | 0 | 0 | 0 | 1 |
| 4 | 114 | 2.4 | 1 | 0 | 0 | 1 |
| 5 | 21 | 0.1 | 1 | 0 | 0 | 0 |
| 6 | 78 | 2.4 | 1 | 0 | 0 | 0 |
| 7 | 79 | 1.7 | 0 | 215 | 1 | 1 |
| 8 | 75 | 0.1 | 0 | 182 | 0 | 1 |
| 9 | 41 | 1.1 | 2 | 0 | 0 | 0 |
seed=7# Seed for reproducibility
# Splitting the data into training and testing sets
x_train,x_test,y_train,y_test =train_test_split(x,y,test_size=0.30,random_state=seed)
Now, we have x_train, x_test, y_train, and y_test ready for training and evaluating our machine learning model. we can use x_train and y_train for training our model and x_test and y_test for evaluating its performance.
Logistic Regression is a statistical method used for binary classification tasks, where the goal is to predict the probability that an instance belongs to a particular class. Despite its name, Logistic Regression is used for classification rather than regression.
# Instantiate a Logistic Regression model
Lr=LogisticRegression()
# Train the model on the training set
Lr.fit(x_train,y_train)
# Make predictions on the test set
y_predict=Lr.predict(x_test)
# Display the predicted values
print("y_predict is: ",y_predict)
# Evaluate the model score on the test set
model_score = Lr.score(x_test,y_test)
print("model_score is :",model_score)
C:\Intel\New folder\Lib\site-packages\sklearn\utils\validation.py:1184: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
y_predict is: [0 1 0 ... 0 0 0] model_score is : 0.9433333333333334
Evaluating a classification model involves assessing its performance and understanding how well it generalizes to new, unseen data. Common metrics for evaluating classification models include accuracy, precision, recall, F1-score, and the confusion matrix
Accuracy: The proportion of correctly classified instances.
Precision: The proportion of true positive predictions among all positive predictions.
Recall (Sensitivity or True Positive Rate): The proportion of true positive predictions among all actual positives.
F1-score: The harmonic mean of precision and recall, providing a balance between the two.
Confusion Matrix: A table showing the counts of true positive, true negative, false positive, and false negative predictions.
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_predict)
# Display the confusion matrix using ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
disp.plot(cmap='Blues', values_format='d') # Use cmap of your choice and set values_format to 'd' for integer formatting
# Show the plot
plt.show()
# Print the confusion matrix
print("Confusion Matrix:")
print(cm)
Confusion Matrix: [[1326 17] [ 68 89]]
Interpreting the confusion matrix and classification report is crucial to understanding how well a classification model is performing.
classification report provides insights into how well the model performs for each class, considering both precision and recall. It also gives an overall accuracy, and the macro and weighted averages provide additional perspectives.
print(classification_report(y_test,y_predict))
precision recall f1-score support
0 0.95 0.99 0.97 1343
1 0.84 0.57 0.68 157
accuracy 0.94 1500
macro avg 0.90 0.78 0.82 1500
weighted avg 0.94 0.94 0.94 1500
Precision:
Recall (Sensitivity or True Positive Rate):
F1-score:
Support: The number of actual occurrences of each class in the specified dataset. Class 0 has significantly more instances (1343) than class 1 (157).
Accuracy: The overall accuracy of the model is 94%. This is the ratio of correctly classified instances to the total instances.
Macro Avg: The macro average calculates metrics independently for each class and then takes the average. It gives equal weight to each class. In this case, the macro average is 0.90 for precision, 0.78 for recall, and 0.82 for F1-score.
Weighted Avg: The weighted average calculates metrics for each class, but it takes into account the relative number of instances of each class. In this case, the weighted average is 0.94 for precision, recall, and F1-score.
Our model seems to perform well overall, with particularly high performance for class 0. The lower recall for class 1 might be a consideration depending on the specific goals of your application, We can give more accurate information about to whom Bank should'nt consider of lending personal loan
Our suggestions aim to not only cater to specific customer groups but also to potentially increase the bank's revenue by selling more loans to these areas.